This tutorial is derived heavily from the folowing sources:
Please give them a watch for a more detailed understanding.
By the end of this tutorial, you should be able to
To learn more, follow them on social media:
import numpy as np
# Creating an array
x = np.zeros(shape=(10,10), dtype='f4')
x.shape, x.dtype
((10, 10), dtype('float32'))
# How much memory does the array use?
x.nbytes
400
# Accessing data in the array
x[:5, :5]
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]], dtype=float32)
y = np.ones(shape=(20,30), dtype='f4')
y[:3,:3]
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]], dtype=float32)
y[:10,:10] = x
y[:3, :3]
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)
y[-3:, -3:]
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]], dtype=float32)
Fundamental properties of Zarr arrays:
import zarr
z = zarr.create(shape=(60, 80), dtype='f4', chunks=(10,10), store='test.zarr')
z
<zarr.core.Array (60, 80) float32>
z.info
| Type | zarr.core.Array |
|---|---|
| Data type | float32 |
| Shape | (60, 80) |
| Chunk shape | (10, 10) |
| Order | C |
| Read-only | False |
| Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
| Store type | zarr.storage.DirectoryStore |
| No. bytes | 19200 (18.8K) |
| No. bytes stored | 337 |
| Storage ratio | 57.0 |
| Chunks initialized | 0/48 |
z.fill_value
0.0
z[10, 12]
0.0
z[:] = 42
z.info
| Type | zarr.core.Array |
|---|---|
| Data type | float32 |
| Shape | (60, 80) |
| Chunk shape | (10, 10) |
| Order | C |
| Read-only | False |
| Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
| Store type | zarr.storage.DirectoryStore |
| No. bytes | 19200 (18.8K) |
| No. bytes stored | 2497 (2.4K) |
| Storage ratio | 7.7 |
| Chunks initialized | 48/48 |
We can attach arbitrary metadata to an array by making use of attributes
z.attrs['units'] = 'degC'
dict(z.attrs)
{'units': 'degC'}
z.store
<zarr.storage.DirectoryStore at 0x112f33850>
!tree -a test.zarr | head
test.zarr
├── .zarray
├── .zattrs
├── 0.0
├── 0.1
├── 0.2
├── 0.3
├── 0.4
├── 0.5
├── 0.6
import json
with open("test.zarr/.zarray") as fp:
display(json.load(fp))
{'chunks': [10, 10],
'compressor': {'blocksize': 0,
'clevel': 5,
'cname': 'lz4',
'id': 'blosc',
'shuffle': 1},
'dtype': '<f4',
'fill_value': 0.0,
'filters': None,
'order': 'C',
'shape': [60, 80],
'zarr_format': 2}
import json
with open("test.zarr/.zattrs") as fp:
display(json.load(fp))
{'units': 'degC'}
Chunking is the main parameter that we control as the user when creating or working with zarr arrays. Choice of chunks could impact the performance to a good extent. There are 2 main points to be considered regarding chunking:
Let's compare a couple of chunking strategies:
a = zarr.create(shape=(100, 100, 100), chunks=(1, 100, 100), dtype='f8', store="a.zarr")
a[:] = np.random.randn(*a.shape)
%time _ = a[:, 0, 0]
CPU times: user 12 ms, sys: 6.54 ms, total: 18.5 ms Wall time: 17.6 ms
b = zarr.create(shape=(100, 100, 100), chunks=(100, 100, 1), dtype='f8', store="b.zarr")
b[:] = np.random.randn(*b.shape)
%time _ = b[:, 0, 0]
CPU times: user 1.2 ms, sys: 997 µs, total: 2.2 ms Wall time: 1.51 ms
Transforming from one chunking strategy to another is not a trivial problem. Refer to this super useful package to help with the this particular problem.
import uuid
my_folder = f"s3://wxml/kushal_test/{uuid.uuid4().hex}"
my_folder
's3://wxml/kushal_test/0ec1566eaa5c4331b0cff11d3d179f3d'
target = f"{my_folder}/test.zarr"
store = zarr.storage.FSStore(target)
group = zarr.group(store=store)
group.create(name="foo", shape=(100, 100), chunks=(10, 10), dtype='f4')
group.create(name="baz", shape=(100, 100), chunks=(20, 20), dtype='i4')
group
<zarr.hierarchy.Group '/'>
group.foo[:] = np.random.rand(*group.foo.shape)
group.foo.info
| Name | /foo |
|---|---|
| Type | zarr.core.Array |
| Data type | float32 |
| Shape | (100, 100) |
| Chunk shape | (10, 10) |
| Order | C |
| Read-only | False |
| Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
| Store type | zarr.storage.FSStore |
| No. bytes | 40000 (39.1K) |
| No. bytes stored | 41843 (40.9K) |
| Storage ratio | 1.0 |
| Chunks initialized | 100/100 |
import xarray as xr
import hvplot.xarray as hvx
ds = xr.tutorial.open_dataset("air_temperature")
ds
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
15. ], dtype=float32)array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
325. , 327.5, 330. ], dtype=float32)array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
'2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
'2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
dtype='datetime64[ns]')[3869000 values with dtype=float32]
PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
15.0],
dtype='float32', name='lat'))PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
325.0, 327.5, 330.0],
dtype='float32', name='lon'))PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
'2013-01-01 12:00:00', '2013-01-01 18:00:00',
'2013-01-02 00:00:00', '2013-01-02 06:00:00',
'2013-01-02 12:00:00', '2013-01-02 18:00:00',
'2013-01-03 00:00:00', '2013-01-03 06:00:00',
...
'2014-12-29 12:00:00', '2014-12-29 18:00:00',
'2014-12-30 00:00:00', '2014-12-30 06:00:00',
'2014-12-30 12:00:00', '2014-12-30 18:00:00',
'2014-12-31 00:00:00', '2014-12-31 06:00:00',
'2014-12-31 12:00:00', '2014-12-31 18:00:00'],
dtype='datetime64[ns]', name='time', length=2920, freq=None))ds.air.hvplot(x='lon', y='lat', cmap='magma')
ds_chunked = ds.chunk({'time': 100})
ds_chunked
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
15. ], dtype=float32)array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
325. , 327.5, 330. ], dtype=float32)array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
'2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
'2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
dtype='datetime64[ns]')
|
||||||||||||||||
PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
15.0],
dtype='float32', name='lat'))PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
325.0, 327.5, 330.0],
dtype='float32', name='lon'))PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
'2013-01-01 12:00:00', '2013-01-01 18:00:00',
'2013-01-02 00:00:00', '2013-01-02 06:00:00',
'2013-01-02 12:00:00', '2013-01-02 18:00:00',
'2013-01-03 00:00:00', '2013-01-03 06:00:00',
...
'2014-12-29 12:00:00', '2014-12-29 18:00:00',
'2014-12-30 00:00:00', '2014-12-30 06:00:00',
'2014-12-30 12:00:00', '2014-12-30 18:00:00',
'2014-12-31 00:00:00', '2014-12-31 06:00:00',
'2014-12-31 12:00:00', '2014-12-31 18:00:00'],
dtype='datetime64[ns]', name='time', length=2920, freq=None))path = f"{my_folder}/air_temp.zarr"
from dask.diagnostics import ProgressBar
with ProgressBar():
ds_chunked.to_zarr(path)
/Users/kushal/Downloads/Analysis_tools/mambaforge/envs/metenv/lib/python3.10/site-packages/xarray/core/dataset.py:2105: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs return to_zarr( # type: ignore[call-overload,misc]
[########################################] | 100% Completed | 6.51 ss
path
's3://wxml/kushal_test/0ec1566eaa5c4331b0cff11d3d179f3d/air_temp.zarr'
ds_from_s3 = xr.open_dataset(path, engine="zarr", chunks='auto')
ds_from_s3
<xarray.Dataset>
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
15. ], dtype=float32)array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
325. , 327.5, 330. ], dtype=float32)array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
'2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
'2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
dtype='datetime64[ns]')
|
||||||||||||||||
PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
15.0],
dtype='float32', name='lat'))PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
325.0, 327.5, 330.0],
dtype='float32', name='lon'))PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
'2013-01-01 12:00:00', '2013-01-01 18:00:00',
'2013-01-02 00:00:00', '2013-01-02 06:00:00',
'2013-01-02 12:00:00', '2013-01-02 18:00:00',
'2013-01-03 00:00:00', '2013-01-03 06:00:00',
...
'2014-12-29 12:00:00', '2014-12-29 18:00:00',
'2014-12-30 00:00:00', '2014-12-30 06:00:00',
'2014-12-30 12:00:00', '2014-12-30 18:00:00',
'2014-12-31 00:00:00', '2014-12-31 06:00:00',
'2014-12-31 12:00:00', '2014-12-31 18:00:00'],
dtype='datetime64[ns]', name='time', length=2920, freq=None))url = "s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/day/pr/gr1/v20180701/"
ds2 = xr.open_dataset(url, engine='zarr', backend_kwargs={'storage_options': {'anon': True}})
ds2
<xarray.Dataset>
Dimensions: (lat: 180, bnds: 2, lon: 288, time: 60225)
Coordinates:
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
lon_bnds (lon, bnds) float64 ...
* time (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
time_bnds (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
pr (time, lat, lon) float32 ...
Attributes: (12/49)
Conventions: CF-1.7 CMIP-6.0 UGRID-1.0
activity_id: CMIP
branch_method: standard
branch_time_in_child: 0.0
branch_time_in_parent: 36500.0
comment: <null ref>
... ...
variable_id: pr
variant_info: N/A
variant_label: r1i1p1f1
status: 2019-09-17;created;by nhn2@columbia.edu
netcdf_tracking_ids: hdl:21.14100/d4ce73dd-d8e0-44ef-847a-b957a138daf6...
version_id: v20180701array([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5,
-79.5, -78.5, -77.5, -76.5, -75.5, -74.5, -73.5, -72.5, -71.5, -70.5,
-69.5, -68.5, -67.5, -66.5, -65.5, -64.5, -63.5, -62.5, -61.5, -60.5,
-59.5, -58.5, -57.5, -56.5, -55.5, -54.5, -53.5, -52.5, -51.5, -50.5,
-49.5, -48.5, -47.5, -46.5, -45.5, -44.5, -43.5, -42.5, -41.5, -40.5,
-39.5, -38.5, -37.5, -36.5, -35.5, -34.5, -33.5, -32.5, -31.5, -30.5,
-29.5, -28.5, -27.5, -26.5, -25.5, -24.5, -23.5, -22.5, -21.5, -20.5,
-19.5, -18.5, -17.5, -16.5, -15.5, -14.5, -13.5, -12.5, -11.5, -10.5,
-9.5, -8.5, -7.5, -6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5,
0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5,
10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5,
20.5, 21.5, 22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5,
30.5, 31.5, 32.5, 33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5,
40.5, 41.5, 42.5, 43.5, 44.5, 45.5, 46.5, 47.5, 48.5, 49.5,
50.5, 51.5, 52.5, 53.5, 54.5, 55.5, 56.5, 57.5, 58.5, 59.5,
60.5, 61.5, 62.5, 63.5, 64.5, 65.5, 66.5, 67.5, 68.5, 69.5,
70.5, 71.5, 72.5, 73.5, 74.5, 75.5, 76.5, 77.5, 78.5, 79.5,
80.5, 81.5, 82.5, 83.5, 84.5, 85.5, 86.5, 87.5, 88.5, 89.5])[360 values with dtype=float64]
array([ 0.625, 1.875, 3.125, ..., 356.875, 358.125, 359.375])
[576 values with dtype=float64]
array([cftime.DatetimeNoLeap(1850, 1, 1, 12, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1850, 1, 2, 12, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1850, 1, 3, 12, 0, 0, 0, has_year_zero=True), ...,
cftime.DatetimeNoLeap(2014, 12, 29, 12, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2014, 12, 30, 12, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2014, 12, 31, 12, 0, 0, 0, has_year_zero=True)],
dtype=object)[120450 values with dtype=object]
[3122064000 values with dtype=float32]
PandasIndex(Index([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5,
...
80.5, 81.5, 82.5, 83.5, 84.5, 85.5, 86.5, 87.5, 88.5, 89.5],
dtype='float64', name='lat', length=180))PandasIndex(Index([ 0.625, 1.8749999999999998, 3.125,
4.375, 5.625, 6.875,
8.125, 9.375, 10.625,
11.875,
...
348.125, 349.375, 350.625,
351.875, 353.125, 354.375,
355.625, 356.875, 358.125,
359.375],
dtype='float64', name='lon', length=288))PandasIndex(CFTimeIndex([1850-01-01 12:00:00, 1850-01-02 12:00:00, 1850-01-03 12:00:00,
1850-01-04 12:00:00, 1850-01-05 12:00:00, 1850-01-06 12:00:00,
1850-01-07 12:00:00, 1850-01-08 12:00:00, 1850-01-09 12:00:00,
1850-01-10 12:00:00,
...
2014-12-22 12:00:00, 2014-12-23 12:00:00, 2014-12-24 12:00:00,
2014-12-25 12:00:00, 2014-12-26 12:00:00, 2014-12-27 12:00:00,
2014-12-28 12:00:00, 2014-12-29 12:00:00, 2014-12-30 12:00:00,
2014-12-31 12:00:00],
dtype='object', length=60225, calendar='noleap', freq='D'))